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Abstract —Speech Recognition searches to predict the spoken 
words automatically. These systems are known to he very 
expensive because of using several pre-recorded hours of speech. 
Hence, building a model that minimizes the cost of the recognizer 
will be very interesting. In this paper, we present a new approach 
for recognizing speech based on belief HMMs instead of proba¬ 
bilistic HMMs. Experiments shows that our belief recognizer is 
insensitive to the lack of the data and it can be trained using 
only one exemplary of each acoustic unit and it gives a good 
recognition rates. Consequently, using the belief HMM recognizer 
can greatly minimize the cost of these systems. 

Index Terms —Speech recognition, HMM, Belief functions. 
Belief HMM. 

I. Introduction 

The automatic speech recognition is a domain of science 
that attracts the attention of the public. Indeed, who never 
dreamed of talking with a machine or at least control an appa¬ 
ratus or a computer by voice. The speech processing includes 
two major disciplines which are the speech recognition and the 
speech synthesis. The automatic speech recognition allows the 
machine to understand and process oral information provided 
by a human. It uses matching techniques to compare a sound 
wave to a set of samples, compounds generally of words or 
sub-words. On the other hand, the automatic speech synthesis 
allows the machine to reproduce the speech sounds of a given 
text. Nowadays, most speech recognition systems are based on 
the modelling of speech units known as acoustic unit. Indeed, 
speech is composed of a sequence of elementary sounds. These 
sounds put together make up words. Then, from these units we 
seeks to derive a model (one model per unit), which will be 
used to recognize continuous speech signal. Hidden Markov 
Models (HMM) are very often used to recognize these units. 
HMM based recognizer is a widely used technique that allows 
as to recognize about 80% of a given speech signal, but this 
recognition rate still not yet satisfying. Also, this method needs 
many hours of speech for training which makes the automatic 
speech recognition task very expensive. 

Recently, IT], Q extend the Hidden Markov Model to 
the theory of belief functions. The belief HMM will avoid 
disadvantages of probabilistic HMM which are, generally, due 
to the use of probability theory. Belief functions are used in 


several domains of research where incertitude and impreci¬ 
sion dominate. They provide many tools for managing and 
processing the existent pieces of evidence in order to extract 
knowledge and make better decision. They allow experts to 
have a more clear vision about their problems, which is helpful 
for finding better solutions. What’s more, belief functions 
theories present a more flexible ways to model uncertainty 
and imprecise data than probability functions. Finally, it offers 
many tools with a higher ability to combine a great number 
of pieces of evidence. 

Belief HMM gives a better classification rate than the 
ordinary HMM when they are applied in a classification 
problem. Consequently, we propose to use the belief HMM 
in the speech recognition process. Finally, we note that this 
is the first time where belief functions are used in speech 
processing. 

In the next section we talk about the probabilistic hidden 
Markov model and we define its three famous problems. In 
Section three we present the probabilistic HMM recognizer, 
the acoustic model and the recognition process. The transfer¬ 
able belief model is introduced in section four. In section five 
we will talk about the belief HMM. In section six, we present 
our belief HMM recognizer, the belief acoustic model and the 
belief recognition process. Finally, experiments are presented 
in section seven. 

H. Probabilistic HMM 

A Hidden Markov Model is a combination of two stochastic 
processes; the first one is a Markov chain that is characterized 
by a finite se|^ of non observable N states (hidden) and 
the transition probabilities, a^- = P | s‘) , 1 < f, j < N, 
between them. The second stochastic process produces the 
sequence of T observations which depends on the proba¬ 
bility density function of the observation model defined as 
b, {Ot) = p (Ot I s‘) , 1 < j < iV, 1 < f < r a, in this 
paper we use a mixture of Gaussian densities. The initial state 
distribution is defined as = P (sj) , 1 < z < A. Hence, 
an HMM A (A, B, H) is characterized by the transition matrix 
A = {aij}, the observation model B = {bj (Ot)} and the 
initial state distribution H = {iTi}. 

*t notes the cuiTent instant, it is put in exponent of states for simplicity. 
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There exist three basic problems of HMMs that must be 
solved in order to be able to use these models in real world 
applications. The first problem is named the evaluation prob¬ 
lem, it searches to compute the probability P{0/X) that the 
observation sequence O was generated by the model A. This 
probability can be obtained using the forward propagation 0). 
Recursively, it estimates the forward variable: 


at{i) 

= P{0i02...0t,qt = Si\X) 

( 1 ) 


( N \ 


at{i) 

= ( ^ ^ 0 : 4-1 (*) Oij 1 bj (Ot) 

( 2 ) 


for all states and at all time instant. Then, P{0/\) = 
Sill (0 obtained by summing the terminal forward 
variables. Also, the backward propagation can be used to re¬ 
solve this problem. Unlike forward, the backward propagation 
goes backward. At each instant, it calculates the backward 
variable: 


/3t(i) = P{Ot+iOt+i...OT\qt = s„ X) (3) 

N 

Pt{i) = Pt+i{i) ( 4 ) 

2 = 1 

finally, P (O | A) = Sti is obtained by combining 

the forward and backward variable. The second problem 
is named the decoding problem. It searches to predict 
the state sequence S that generated O. The Viterbi ||4| 
algorithm solves this problem. It starts from the first instant, 
t = 1, for each moment t, it calculates 6t{i) for every state 
i, then it keeps the state which have the maximum St = 
((71,92, -94-1,94 = J, O 1 O 2 • • • O 4-1 I A) = 

maxi<i< 7 v (<54-1 (*) atj) bj (Ot). When, the algorithm reaches 
the last instance t = T, it keeps the state which maximize St- 
Finally, Viterbi algorithm back-track the sequence of states as 
the pointer in each moment t indicates. The last problem is 
the learning problem, it seeks to adjust the model parameters 
in order to maximize P (O | A). Baum-Welch Q method is 
widely used. This algorithm uses the forward and backward 
variables to re-estimate the model parameters. 

III. Prebabilistic HMM based recognizer 


the language, the problem with this choice is that the phone 
do not model its context. Such a model is called context 
independent model. These models are generally used for 
speech segmentation systems. Other units that take the context 
into account can be used as acoustic unit as the diphone which 
model the transition between two phones, the triphone which 
model the transition between three phones, subwords, words. 
These models are called context dependent models. According 
to Is), when the context is greater, the recognition performance 
improve. 

b) The model: for each acoustic unit we associate an 
HMM, then types of HMM model and the probability density 
function of the observation must be chosen. Generally, left- 
right models are used for speech recognition and speech 
synthesis systems i). In fact. Speech signal has the property 
that it changes over time, then the choice of the left-right 
model is justified by the fact that there is no back transitions 
and all transitions goes forward. The number of states is fixed 
in advance or chosen experimentally. na, a fixed the number 
of state to three. This choice is justified by the fact that 
most phoneme acoustic realization is characterized by three 
sub-segments, hence we have a state for each sub-segment. 
ID, iia used an HMM of six states. Finally, we choose 
the probability density function of the observation. They are 
represented by a mixture of Gaussian pdf, the number of 
mixtures is generally chosen experimentally. 

The next step, consists on training parameters of each HMM 
using a speech corpus that contains many exemplary of each 
acoustic unit. Speech segments are transformed into sequence 
of acoustic vectors by the mean of a feature extraction method 
like MFCC, these acoustic vectors are our sequence of obser¬ 
vations. 

Then, HMMs are concatenated to each other and we obtain 
the model that will be used to recognize the new speech 
signal. The recognizer contains three levels; the first one is 
the syntactic level. It represents all possible word sequences 
that can be recognized by our model. The second level is 
the lexical level. It represents the phonetic transcription (the 
phoneme sequence) of each word. Finally, the third level is 
the acoustic level. It models the realization of each acoustic 
unit (in this case the phone). 


A. Acoustic model 

The acoustic model attempts to mimic the human auditory 
system, it is the model used by the HMM-based speech recog¬ 
nizer in order to transform the speech signal into a sequence 
of acoustic units, this last will be transformed into phoneme 
sequence and finally the desired text is generated by converting 
the phoneme sequence into text. Acoustic models are used by 
speech segmentation and speech recognition systems. 

The acoustic model is composed of a set of HMMs a, 
each HMM corresponds to an acoustic unit. To have a good 
acoustic model some choices have to be done: 

a) The acoustic unit: the choice of the acoustic unit is 
very important, in fact, the number of them will influence 
the complexity of the model (more large the number, more 
complex the model). If we choose a small unit like the 
phone we will have an HMM for every possible phone in 


B. Speech recognition process 

The model described above is used for the speech recogni¬ 
tion process. Let S be our speech signal to be recognized. 
Recognizing S consists on finding the most likely path in 
the syntactic network. The first step, is to transform S into a 
sequence of acoustic vectors using the same feature extraction 
method used for training, then we obtain our sequence of 
observation O. The most likely path is the path that maximizes 
the probability of observing O such the model P (0|A). This 
probability can be done either by using the forward algorithm, 
or the Viterbi algorithm. 

IV. Transferable Belief Model 
The Transferable Belief Model (TBM) HD, QO) is a well 
used variant of belief functions theories. It is a more general 
system than the Bayesian model. 
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Let ilt = {uji,uj 2 , Wn} be our frame of discernment. The 
agent belief on fit is represented by the basic belief assignment 
(BBA) defined from 2^ to [0,1]. mP* {A) is the mass 
value assigned to the proposition A C and it must respect: 
Sytctij (^) = 1- Also, we can define conditional BBA. 
Then we can have m^* (A) which is a BBA defined 

conditionally to 5'*“^ C If we have mP* (0) > 0, our 

BBA can be normalized by dividing the other masses by 1 — 
rrP^ ( 0 ) then the conflict mass id redistributed and mp* ( 0 ) = 
0 . 

Basic belief assignment can be converted into other func¬ 
tions. They represent the same information under other forms. 
What’s more, they are in one to one correspondence and 
they are defined from 2^ to [0,1]. We will use belief bel, 
plausibility pi and commonality q functions: 


beP (A) 
rrP (A) 
pP (A) 
rrP (A) 
(A) 
rrP (A) 


^ (B), VA C 0, A ^ 0 (5) 

beP (B) ,yAcn (6) 

BCA 

rrP {B) , VA C (7) 

BnA=0 

Y pP (B) ,yAcn (8) 

BCA 

Y rrP (B) , VA C (9) 

BCA 

Y q^ (B) , VA C O (10) 

ACB 


Consider two distinct BBA and defined on 17, we can 
obtain m^p ,2 through the TBM conjunctive rule (also called 
conjunctive rule of combination CRC) ID as: 


’7^?n2(A)= Y (^) ^2 (C), VA C 17 (11) 

BnC=A 


Equivalently, we can calculate the CRC via a more simple 
expression defined with the commonality function: 

q%{A)=q^{A)q^{A),yACn ( 12 ) 


V. Belief HMM 

Belief HMM is an extension of the probabilistic HMM to 
belief functions [Tl, Q, ||8]. Like probabilistic HMM, the 
belief HMM is a combination of two stochastic processes. 
Hence, a belief HMM is characterized by: 

• The credal transition matrix A = {m^*- (Sj)} a 

set of BBA functions defined conditionally to all possible 
subsets of states 5*“^, 

• The observation model B = [Ot] (5'*)|a set of 

BBA functions defined conditionally to the set of possible 
observation Ot, 

• The initial state distribution H = 

The three basic problem of HMM and their solutions are 
extended to belief functions. As we know the forward al¬ 
gorithm resolves the evaluation problem in the probabilistic 
case. 121 introduced the credal forward algorithm in order to 


resolve this problem in the evidential case. It needs as inputs 
™a‘ ['5'*”^] {Sj) and [Ot] {S*) to calculate the forward 
commonality: 

(sY) = I Y (^0 (sY) 

\StCQt 

nqY" [Ot] (5* + 1) (13) 

This last is calculated recursively from f = 1 to T. H exploits 
the conflict of the forward BBA (obtained by using formula [T0|) 
to define an evaluation metric that can be used for classification 
to choose the model that best fits the observation sequence or 
it can also be used to evaluate the model. Then, given a model 
A and an observation sequence of length T, the conflict metric 
is defined by: 

1 ^ 

■^c(A) =-^log(l-m2‘+M^](0)) (14) 

i=l 

A* = argmaxLc (A) (15) 

A credal backward algorithm is also defined, recursively, it 
calculates the backward commonality from T to f = 1. More 
details can be found in m, 0. For the decoding problem, 
many solutions are proposed to extend the Viterbi algorithm 
to the TBM 171, 0, 0. All of them search to maximize the 
state sequence plausibility. According to the definition given 
in 0, the plausibility of a sequence of singleton states S = 
s^,..., , s* £ V,t is given by: 


pis {S) = pl^ {s^) .Y[pla" [s* (s*) (s*) (16) 


Hence, we can choose the best state sequence by maximiz¬ 
ing this plausibility. For the learning problem, 0, 0 have 
proposed some solutions to estimate model parameters, we 
will talk about the method used in this paper. The first step 
consists on estimating the mixture of Gaussian models (GMM) 
parameters using Expectation-Maximization (EM) algorithm. 
For each state we estimate one GMM. These models are 
used to calculate [Ot] (S*'). 0 proposes to estimate the 
credal transition matrix independently from the transitions 
themselves. He uses the observation BBAs as: 


Qt X ^t+i 
m- ^ (X 


T- 1 

T 


(17) 


■Y [OtY^'^^*^^ nmY" [Ot+i]^^‘'^^‘+^) 




where are com¬ 

puted using the vacuous extension operator a of the BBA 
Y)* [Ot] Yj) on the cartesian product space as: 




(H) if a = H X f7t+i 

0 otherwise 


(18) 


This estimation formula is used by 0 as an initialization 
for/IIS' (Iterative Transition Specialization) algorithm. ITS is 
an iterative algorithm that uses the credal forward algorithm to 
improve the estimation results of the credal transition matrix. 
It stops when the conflict metric (formula [l4|i converged. 
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VI. Belief HMM based recognizer 

Our goal is to create a speech recognizer using the belief 
HMM instead of the probabilistic HMM. HMM recognizer 
uses an acoustic model to recognize the content of the speech 
signal. Then, we seek to mimic this model in order to create a 
belief HMM based one. We should note that existent parameter 
estimation methods presented for the belief HMM cannot be 
used to estimate model parameters using multiple observation 
sequences. This fact should be taken into account when we 
design our belief acoustic model. 



A. Belief acoustic model 

In the probabilistic case, we use an HMM for each acoustic 
unit, its parameters are trained using multiple speech realiza¬ 
tion of the unit 0, 0, 0, 112, 0. In the credal case, a 
similar model cannot be used. Hence, we present an alternate 
method that takes this fact into account. 

Let K be the number of the speech realization of a given 
acoustic unit. These speech realization are transformed into 
MFCC feature vectors. Hence, we obtain K observation 
sequences. Our training set will be:0 = [O^, ..., O^] 

where is the observation se¬ 

quence of length Tk- These observations are supposed to be 
independent to each other. So instead of training one model 
for all observation set O, we propose to create a belief model 
for each observation sequence These K models will be 
used to represent the given acoustic unit in the recognition 
process. 

Like the acoustic model based on the probabilistic HMM, 
we have to make some choices in order to have a good belief 
acoustic model. In the first place, we choose the acoustic unit. 
The same choices of the probabilistic case can be adopted for 
the belief case. In the second place, we choose the model. 
We should note that we cannot choose the topology of the 
belief HMM, this is due to the estimation process of the 
credal transition matrix. In other words, the resultant credal 
observation model is used to estimate the credal transition 
matrix which does not give as the hand to choose the topology 
of our resultant model. Consequently, choosing the model in 
the credal case consists on choosing the number of states 
and the number of Gaussian mixtures. In our case we fix 
the number of states to three and we choose the number of 
Gaussian mixtures experimentally. 


B. Speech recognition process 

The belief acoustic model is used in the speech recognition 
process. Now, we explain how the resultant model will be used 
for recognizing speech signal. 

Let S be our speech signal to be recognized. Recognizing 
S consists on finding the most likely set of models. The first 
step, is to transform S into a sequence of acoustic vectors 
using the same feature extraction method used for training, 
then we obtain our sequence of observation O. This last is 
used as input for all models. The credal forward algorithm is 
then applied, each model gives us an output which is the value 
of the conflict metric. An acoustic unit is presented by a set 


Figure 1. Influence of the number of observations on the recognition rate 


of models, every model gives a value for the conflict metric. 
Then we calculate the arithmetic mean of the resultant values. 
Finally, we choose the set of models that optimizes the average 
of the conflict metric instead of optimizing the conflict metric, 
as proposed by 0, using formula 

VH. Experiments 

In this section we present experiments in order to validate 
our approach. We compare our belief HMM recognizer to a 
similar one implemented using the probabilistic HMM. 

We use MFCC (Mel Frequency Cepstral Coefficient) as 
feature vectors. Also, we use a three state HMM and two 
Gaussian mixtures. Finally, to evaluate our models we calcu¬ 
late the percent of correctly recognized acoustic units (number 
of correctly recognized acoustic unit / total number of acoustic 
units). We use a speech corpus that contains speech realization 
of seven different acoustic units and we have fifteen exemplary 
of each one. Results are shown in figure [T] 

The lack of data for training the probabilistic HMM leads 
to a very poor learning and the resultant acoustic model 
cannot be efficient. Then using a training set that contains 
only one exemplary of each acoustic unit leads to have a bad 
probabilistic recognizer. In this case our belief HMM based 
recognizer gives a recognition rate equal to 85.71% against 
13.79% for the probabilistic HMM which is trained using 
HTK 112. This results shows that the belief HMM recognizer 
is insensitive to the lack of data and we can obtain a good 
belief acoustic model using only one observation for each 
unit. In fact, the belief HMM models knowledge by taking 
into account doubt, imprecision and conflict which leads to a 
discriminative model in the case of the lack of data. 

HTK is a toolkit for HMMs and it is optimized for the HMM 
speech recognition process. It is known to be powerful under 
the condition of having many exemplary of each acoustic 
unit. Hence, it needs to use several hours of speech for 
training. Having a good speech corpus is very expensive which 
influence the cost of the recognition system. Then, the speech 
recognition systems are very expensive. Consequently, using 
the belief HMM recognizer can greatly minimize the cost of 
these systems. 
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VIII. Conclusion 

In this paper, we proposed the Belief HMM recognizer. 
We showed that incorporating belief functions theory in the 
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speech recognition process is very beneficial, in fact, it reduces 
considerably the cost of the speech recognition system. Future 
works will be focuced on the case of the noisy speech signal. 
Indeed, existent speech recognizer still not yet good if we have 
a noisy signal to be decoded. 
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